AITopics | public domain

Collaborating Authors

public domain

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

How Putin became master of the image

BBC NewsMay-30-2026, 23:17:26 GMT

Throughout his time as Russian President, Vladimir Putin has been alert to the power of visual imagery. The first time I interviewed him in 2001, an aide swooped in just before the cameras went live and snatched away the small water glasses on the table in front of us. Why did you do that? We wouldn't want anyone to think they were for vodka, came the reply. And anyway, we can't risk a glass spilling live on TV.

artificial intelligence, putin, ukraine, (12 more...)

BBC News

Country:

North America (1.00)
Europe (1.00)
Asia > Russia (1.00)

Genre: Personal > Interview (0.54)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Government > Regional Government > Europe Government > Russia Government (1.00)
Government > Regional Government > Asia Government > Russia Government (1.00)

Technology: Information Technology > Artificial Intelligence (0.47)

Add feedback

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset Peter Henderson

Neural Information Processing SystemsAug-18-2025, 08:08:56 GMT

Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material.

data mining, information, machine learning, (19 more...)

Neural Information Processing Systems

Country:

Europe > Germany (0.28)
North America > Canada > British Columbia (0.04)
North America > United States > Virginia (0.04)
(18 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.67)

Industry:

Law > Statutes (1.00)
Law > Litigation (1.00)
Law > International Law (1.00)
(10 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
Information Technology > Communications (1.00)
(4 more...)

Add feedback

Checklist 1. For all authors (a)

Neural Information Processing SystemsAug-16-2025, 23:39:42 GMT

Another limitation is that the linear model seems to outperform the rank-one quadratic model; we do not fully understand this effect, as discussed in the last paragraph of section 4. A third limitation is that models need to be averaged across time to obtain a single, deployable model: see Figure 5. A final limitation is that we do not yet have convergence theorems or regret bounds for the passive-aggressive updates in these models; see the second paragraph of section 5. (c) Did you discuss any potential negative societal impacts of your work?

artificial intelligence, limitation, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Health Insurance Coverage Rule Interpretation Corpus: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding

Gartner, Mike

arXiv.org Artificial IntelligenceAug-7-2025

U.S. health insurance is complex, and inadequate understanding and limited access to justice have dire implications for the most vulnerable. Advances in natural language processing present an opportunity to support efficient, case-specific understanding, and to improve access to justice and healthcare. Yet existing corpora lack context necessary for assessing even simple cases. We collect and release a corpus of reputable legal and medical text related to U.S. health insurance. We also introduce an outcome prediction task for health insurance appeals designed to support regulatory and patient self-help applications, and release a labeled benchmark for our task, and models trained on it.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2508.03718

Country: North America > United States (1.00)

Genre: Research Report (0.82)

Industry:

Law (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Utah's High-Stakes PR Campaign to Wrest Control of Public Lands

Mother JonesJun-23-2025, 10:00:00 GMT

Utah Attorney General Sean Reyes speaks at the Utah State Capitol in Salt Lake City, last year after state leaders announced they are suing the federal government over 18.5 million acres of Bureau of Land Management land, which covers about 34% of Utah.Saige Miller / KUER via High Country News This story was originally published by High Country News and Public Domain and is reproduced here as part of the Climate Desk collaboration. Last year, as Utah prepared to file a federal lawsuit aiming to take control of millions of acres of federal public land within its borders, state officials sought help swaying public opinion in their favor. So they turned to a group of public relations professionals at Penna Powers, a media and branding firm based in Salt Lake City. Backed with a commitment of more than two million in taxpayer funds, the firm sprang into action. One of the early orders of business was studying the opposition. In June 2024, an assistant attorney general sent an email to numerous state government colleagues and Penna Powers staffers that contained a video from the Theodore Roosevelt Conservation Partnership (TRCP) in which the well-known hunter and media personality Randy Newberg described the dangers of transferring federal land to state control.

artificial intelligence, social media, utah, (17 more...)

Mother Jones

Country:

North America > United States > Utah > Salt Lake County > Salt Lake City (0.45)
North America > United States > District of Columbia > Washington (0.04)
North America > United States > Colorado (0.04)

Industry:

Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Communications > Social Media (0.30)

Add feedback

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Langlais, Pierre-Carl, Hinostroza, Carlos Rosas, Nee, Mattia, Arnett, Catherine, Chizhov, Pavel, Jones, Eliot Krzystof, Girard, Irène, Mach, David, Stasenko, Anastasia, Yamshchikov, Ivan P.

arXiv.org Artificial IntelligenceJun-3-2025

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for language model pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets; in addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this technical report, we present the detailed provenance of data assembling and the details of dataset filtering and curation. Being already used by such industry leaders as Anthropic and multiple LLM training projects, we believe that Common Corpus will become a critical infrastructure for open science research in LLMs.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.01732

Country:

North America > United States (1.00)
Europe (1.00)
Asia (0.67)

Genre: Research Report (0.51)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models

Bommarito, Michael J II, Bommarito, Jillian, Katz, Daniel Martin

arXiv.org Artificial IntelligenceApr-11-2025

Practically all large language models have been pre-trained on data that is subject to global uncertainty related to copyright infringement and breach of contract. This creates potential risk for users and developers due to this uncertain legal status. The KL3M Data Project directly confronts this critical issue by introducing the largest comprehensive training data pipeline that minimizes risks related to copyright or breach of contract. The foundation of this project is a corpus of over 132 million documents and trillions of tokens spanning 16 different sources that have been verified to meet the strict copyright and licensing protocol detailed herein. We are releasing the entire pipeline, including 1) the source code to acquire and process these documents, 2) the original document formats with associated provenance and metadata, 3) extracted content in a standardized format, 4) pre-tokenized representations of the documents, and 5) various mid- and post-train resources such as question-answer, summarization, conversion, drafting, classification, prediction, and conversational data. All of these resources are freely available to the public on S3, Hugging Face, and GitHub under CC-BY terms. We are committed to continuing this project in furtherance of a more ethical, legal, and sustainable approach to the development and use of AI models.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2504.07854

Country:

North America > United States (1.00)
Europe (0.93)

Genre:

Research Report (0.82)
Overview (0.67)

Industry:

Law > Statutes (1.00)
Law > Intellectual Property & Technology Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

China's DeepSeek impresses. But is a 'fast follow' good enough in AI?

Christian Science Monitor | ScienceFeb-4-2025, 16:27:53 GMT

American stock markets shuddered on Monday, prompted by China's announcement that it has created a capable, cheap, artificial intelligence machine. It's the biggest cloud yet to darken the West's blue-sky enthusiasm over AI, calling into question the efficacy of America's export controls and the billions of dollars the United States is pouring into the technology's expensive cutting edge. China startup DeepSeek says its AI assistant uses less advanced chips than its rivals' models do, and it costs less to train. Unlike the West's billions, the Chinese model was developed for just 5.6 million, by one estimate. "Are we going to spend 500 billion to get to the frontier so that China can find a way to copy our homework for pennies on the dollar?"

china, deepseek, deepseek impress, (9 more...)

Christian Science Monitor | Science

Country:

Asia > China > Beijing > Beijing (0.06)
North America > United States > New York > New York County > New York City (0.05)

Industry:

Banking & Finance > Trading (1.00)
Government > Regional Government > North America Government > United States Government (0.31)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

Add feedback

Towards Best Practices for Open Datasets for LLM Training

Baack, Stefan, Biderman, Stella, Odrozek, Kasia, Skowron, Aviya, Bdeir, Ayah, Bommarito, Jillian, Ding, Jennifer, Gahntz, Maximilian, Keller, Paul, Langlais, Pierre-Carl, Lindahl, Greg, Majstorovic, Sebastian, Marda, Nik, Penedo, Guilherme, Van Segbroeck, Maarten, Wang, Jennifer, von Werra, Leandro, Baker, Mitchell, Belião, Julie, Chmielinski, Kasia, Fadaee, Marzieh, Gutermuth, Lisa, Kydlíček, Hynek, Leppert, Greg, Lewis-Jong, EM, Larsen, Solana, Longpre, Shayne, Lungati, Angela Oduor, Miller, Cullen, Miller, Victor, Ryabinin, Max, Siminyu, Kathleen, Strait, Andrew, Surman, Mark, Tumadóttir, Anna, Weber, Maurice, Weiss, Rebecca, White, Lee, Wolf, Thomas

arXiv.org Artificial IntelligenceJan-14-2025

Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.

dataset, license, public domain, (17 more...)

arXiv.org Artificial Intelligence

2501.08365

Country:

Asia > Japan (0.24)
North America > United States > New York (0.04)
Europe > France (0.04)

Genre: Research Report (0.81)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Law > Litigation (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

RAM: Towards an Ever-Improving Memory System by Learning from Communications

Li, Jiaqi, Wang, Xiaobo, Ding, Wentao, Wang, Zihao, Kang, Yipeng, Jia, Zixia, Zheng, Zilong

arXiv.org Artificial IntelligenceJul-5-2024

More in the human direction until they were recently, retrieval-augmented generation (RAG; present." Lewis et al. (2020)) is proposed to enable accessing and precisely manipulating the memory with a --Tomasello (2010) disentangled knowledge storage system; refer to 4 Human learning, extended as a lifelong process, for details. However, conventional RAG augments typically operates in a communicative and cooperative LLMs with a static and exterior knowledge to address framework among people via different forms knowledge-intensive tasks. Fundamentally, of interactions within the physical and social world, the main challenge of building CLAI agents lies as evidenced by the above quotes of Tomasello in determining when and how to update dynamic (2010). From toddlers to academic graduates, the and internal knowledge given communicative feedback.

knowledge, retrieval, similarity, (13 more...)

arXiv.org Artificial Intelligence

2404.12045

Country:

North America > United States (0.14)
Europe > Italy (0.14)
Europe > United Kingdom (0.14)
(4 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Government (0.94)
Leisure & Entertainment > Sports (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback